In this project, a data set of red wine quality will be explored based on its physicochemical properties using the statistical software, R. The objective is to find physicochemical properties that distinguish good quality wine from lower quality ones. An attempt to build linear model on wine quality will also be shown.
Firstly, I would like to understand the data set structure. Summary and str functions were used for this purpose. This data set consists of 1599 observations with 11 physicochemical properties as input variables and quality as the output. Wine quality is an ordered and discrete variable, the quality ranges from 3.0 to 8.0, with mean and median of 5.6 and 6.0, respectively. Each observation is identified in X variable. From the data set description, there is a pair of subset (dependant) variables that is free sulfur dioxide to total sulfur dioxide.
After the first look of the data set, I will now plot those variables in histogram to have a quick glance of the distribution.
## [1] "=== Stats Summary of pH ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "=== Stats Summary of Density ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
pH and density appear to be normally distributed. The normal distribution is confirmed by almost equal mean and median values. Other variables are mostly long-tailed with a few outliers. I will replot the long-tailed distributions in log scale and compare it to its original plot along with their stats summaries.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed acidity in log scale seems to be more normal.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Distribution of volatile acidity in log scale is also more normal, however it still looks slightly skewed.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Number of zero values"
##
## FALSE TRUE
## 1467 132
In the original plot, there is suspiciously high count of zero in citric acid. I wonder if this is truly zero or simply a ‘not available’ value. A quick check using table function shows that there are 132 observations of zero values and no NA value in reported citric acid concentration. The citric acid concentration could be too low and insignificant hence was reported as zero. Replotting citric acid distribution in log scale does not help normalizing the distribution, it could be due to the high count of zero values mentioned above.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar distribution is more normal in log scale however it is still long-tailed.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Distribution of log(chlorides) is more normal than distribution of chlorides.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Replotting free sulfur dioxide distribution in log scale shows a bimodal distribution behaviour.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Distribution of total sulfur dioxide in log scale is more normal.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Distribution of log(sulphates) is normal.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Replotting alcohol in log scale does not normalize the distribution.
## [1] "=== Stats Summary ==="
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Distribution of quality in log scale is more skewed than its original scale. In the original plot, we can see that the count of mid range quality (quality of 5 and 6) is considerably higher than others. This might become an issue when comparing low and high quality wines.
Wine quality ranges from 3.0 to 8.0 in this data frame. Since I am more inclined in investigating what makes a higher quality wine, I will add a new variable quality.rating to categorise quality values of 3.0-4.0 as ‘bad’, 5.0-6.0 as ‘average’, and 7.0-8.0 as ‘good’.
rw$quality.rating <- ifelse(rw$quality <5.0, 'bad', 'average')
rw$quality.rating <- ifelse(rw$quality >6.0, 'good',rw$quality.rating)
rw$quality.rating <- ordered(rw$quality.rating, levels = c('bad','average','good'))
rw[1:20, 13:14]
## quality quality.rating
## 1 5 average
## 2 5 average
## 3 5 average
## 4 6 average
## 5 5 average
## 6 5 average
## 7 5 average
## 8 7 good
## 9 7 good
## 10 5 average
## 11 5 average
## 12 5 average
## 13 5 average
## 14 5 average
## 15 5 average
## 16 5 average
## 17 7 good
## 18 5 average
## 19 4 bad
## 20 6 average
The distribution of quality rating is much higher on the average rating wine as seen in quality distribution. This is likely to cause overplotting therefore I will be comparing only the bad and good wines to find distinctive properties that separate these two.
The feature of interest is wine quality.
Based on the data description given, I suspect acids, residual sugar and sulfur dioxide will have effect on the taste hence wine quality.
Quality.rating variable is created to group the wine quality into three ratings, bad, average and good.
pH and density appear to be normally distributed. Other variables are mostly long-tailed with a few outliers. At this stage, I have not change format of the data.
In this section, I will look into comparing two variables and see if there is any correlation between them. I have used ggpairs function to spot obvious pattern between variables.
To better visualise the strong correlation between variables, I will use corrplot function
# Exclude variable X in correlation calculation
rw.noX <- subset(rw, select =- X)
rw.corr <- cor(rw.noX[sapply(rw.noX, is.numeric)])
corrplot(rw.corr, method = "number")
The ggpairs and corrplot functions above highlight correlation between two variables. I will only be looking into variables with correlation coefficient above 0.30 and below -0.30. Let’s put those data wrangling skill to use and show those variables in a new data frame!
#Re-arrange the correlation table
rw.corr.melt <- melt(rw.corr)
rw.corr.melt <- subset(rw.corr.melt,
value <=-0.30 | value >=0.30)
rw.corr.melt <- subset(rw.corr.melt,
value != 1.0)
rw.corr.melt <- arrange(rw.corr.melt,
desc(value))
# delete the repeated correlation
rw.corr.melt <- rw.corr.melt[-seq(2,
nrow(rw.corr.melt),
by =2),]
rw.corr.melt$value <- round(rw.corr.melt$value, 3)
rw.corr.melt
## X1 X2 value
## 1 citric.acid fixed.acidity 0.672
## 3 density fixed.acidity 0.668
## 5 total.sulfur.dioxide free.sulfur.dioxide 0.668
## 7 quality alcohol 0.476
## 9 sulphates chlorides 0.371
## 11 density citric.acid 0.365
## 13 density residual.sugar 0.355
## 15 sulphates citric.acid 0.313
## 17 pH density -0.342
## 19 quality volatile.acidity -0.391
## 21 alcohol density -0.496
## 23 pH citric.acid -0.542
## 25 citric.acid volatile.acidity -0.552
## 27 pH fixed.acidity -0.683
Now we have the list of variable pairs, let’s plot them!
Fixed acidity has a strong positive correlation to citric acid. In the description of data attributes, fixed acidity is defined as ‘most acids involved with wine or fixed or nonvolatile (do not evaporate easily)’. I am wondering if this means citric acid is part of fixed acidity. If it is, other variables that correlates well to fixed acidity will also show some correlation to citric acid. A quick peek into rw.corr.melt data frame seems to have proven this finding. I will discuss more about this as I plot the rest of the graphs.
The plot above shows strong positive correlation between fixed acidity and density. If our previous suspicion is true, we will also see some correlation between density and citric acid.
The citric acid is indeed correlated to density. Even though the correlation is not as strong as fixed acidity to density, the linear regression line seems to show some linear relationship between the two variables. Now let’s find other variable that is correlated to fixed acidity and compare it to citric acid.
pH is very well correlated to both fixed acidity and citric acid. On second thought, the strong correlation of pH and density to both acids could just be common physical properties of acids. The strong correlation we saw in citric acid and fixed (tartaric) acid could be the result of both acids being predominant fixed acids found in wine grapes Nierman (2004).
From the plot, we can see an clear relationship between free sulfir dioxide and total sulfur dioxide. This can confirm free sulfur dioxide being subset of total sulfur dioxide.
This plot is rather interesting. There is overcrowding in quality of 5 and 6 due to higher number of mid range quality wine in data set. However if we compare the low quality (3-4) to high quality (7-8), there is a trend of increasing alcohol content from low to high wine quality.
Sulphates and chlorides seems to have some correlation however it is rather poor.
As sugar solution is denser than water, it is expected to see increasing density as residual sugar concentration increases. The plot above shows rather weak correlation between these two variables.
There is a weak positive correlation between citric acid and sulphates.
This plot shows red wine with lower pH tends to have higher density.
The plot above shows negative correlation between volatile acidity and wine quality. High quality red Wines have lower volatile (acetic) acid.
It is expected that decreasing density as alcohol content increases in the plot. In fermentation process, sugar is turned into alcohol. The more alcohol produced, the less sugar remains hence lower density.
Volatile acidity shows strong negative correlation to citric acid. The strong correlation can be explained by subsequent conversion of citric acid to acetic acid in a wine making practice involving malo-lactic bacterium Shimazu et al. (1985)
Variables that show correlation to quality are alcohol and volatile acidity. Alcohol content has positive correlation to wine quality. On the other hand, volatile acidity is negatively correlated to quality.
pH to fixed acidity has the strongest relationship, which makes sense as pH is the scale to measure acidity. This is followed by fixed acidity to citric acid, fixed acidity to density and free sulfur dioxide to total sulfur dioxide.
I would also like to see if any particular variable in log scale has stronger correlation to quality. The comparison will be shown in a new data frame, ‘df.corr’.
## Correlation to quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
## Correlation (log scale) to quality
## fixed.acidity 0.11
## volatile.acidity -0.39
## citric.acid NaN
## residual.sugar 0.02
## chlorides -0.18
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.17
## density -0.18
## pH -0.06
## sulphates 0.31
## alcohol 0.48
## quality 1.00
The new data frame above shows that transforming sulphates into log scale improves its correlation to quality. The same is observed in chlorides however its correlation to wine quality is not as strong as sulphates. Sulphates and chlorides will be converted to log scale for the rest of analysis.
I will now compare the original and log variables to quality in a graph (to put writing user-defined function into practice). The graphs will be plotted side-by-side for comparison, with graph on right handside in log scale. The grid is splitted into three separate plots for better visuality.
From the comparison between original and log scale plots, transforming sulphates and chlorides seems to improve the correlation slightly. This also reflected when distribution of sulphates and chlorides were replotted in Univariate Plots Section. Distribution of log(sulphates) and log(chlorides) appeared more normal than their original plots.
In this section, I will mostly focus on variables that are well correlated to quality. They are alcohol content, volatile acidity and sulphates.
This agrees with data attributes description that high level of volatile acid gives unpleasant, vinegar taste hence low wine quality. Also, good wines have higher alcohol content than bad wines.
Negative correlation between alcohol and density is consistent in all three ratings. The plot also shows that while holding density constant, bad rating wine has lower alcohol content compared to good rating wine. It is good to see change of slope steepness as the wine rating gets better.
This plot shows better wine tends to have higher sulphates concentration. The range of sulphate concentration for a certain wine rating seems to be narrow.
Good wines have lower volatile (acetic) acid than bad and average wines. pH does not seem to affect quality rating. From volatile acidity below 0.5 g/L, we can see that better wines have lower pH when holding volatile acidity constant.
According to its correlation coefficient, citric acid is mildly correlated to wine quality. However at low volatile acidity (<0.6 g/L), there is a trend of better wine comes with higher citric acid concentration.
To build the linear predicting model, I will be using variables with highest correlation to wine quality.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates),
## data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) +
## citric.acid, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) +
## citric.acid + total.sulfur.dioxide, data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) +
## citric.acid + total.sulfur.dioxide + density, data = rw)
##
## ===========================================================================================
## m1 m2 m3 m4 m5 m6
## -------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.444*** 3.658*** 1.627
## (0.175) (0.184) (0.184) (0.196) (0.201) (12.006)
## I(alcohol) 0.361*** 0.314*** 0.303*** 0.303*** 0.290*** 0.292***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.020)
## volatile.acidity -1.384*** -1.156*** -1.217*** -1.176*** -1.181***
## (0.095) (0.097) (0.112) (0.112) (0.116)
## log(sulphates) 0.641*** 0.659*** 0.671*** 0.669***
## (0.077) (0.079) (0.078) (0.080)
## citric.acid -0.113 -0.075 -0.085
## (0.103) (0.103) (0.119)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density 2.021
## (11.948)
## -------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.346 0.353 0.353
## adj. R-squared 0.226 0.316 0.344 0.344 0.351 0.351
## sigma 0.710 0.668 0.654 0.654 0.650 0.651
## F 468.267 370.379 280.646 210.808 174.115 145.012
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1587.153 -1578.057 -1578.043
## Deviance 805.870 711.796 682.108 681.597 673.886 673.874
## AIC 3448.114 3251.628 3185.503 3186.306 3170.114 3172.085
## BIC 3464.245 3273.136 3212.389 3218.569 3207.754 3215.102
## N 1599 1599 1599 1599 1599 1599
## ===========================================================================================
R-squared from the model is rather low, this could be due to lack of variable that shows strong correlation to wine quality. The predicting model seems to fit better to average rating wine, this may be caused by larger distribution of average wine in data set.
Low volatile acidity combined with high alcohol content and sulphates seem to make better wines.
Negative correlation between alcohol and density is consistent in all three ratings. The plot also shows bad rating wine has lower alcohol content compared to good rating wine.
This plot shows the influence of alcohol content and density to wine rating. Negative correlation between alcohol and density is consistent in all three wine ratings. This negative correlation can be explained by the fermentation process in wine making. Sugar content is directly proportional to density, higher sugar content leads to higher density. In fermentation process, sugar is converted to alcohol. The more alcohol produced, the less sugar remains hence lower density. The change of slope steepness as the wine rating gets better is shown as expected. The plot also shows that while holding density constant, bad rating wine has lower alcohol content compared to good rating wine. The average rating wine data is ignored in inferring relationship between variables due to significantly higher number of average rating data.
This chart reveals influence of alcohol and sulphates concentration to red wine rating. It shows that better wines tend to have higher sulphates and alcohol concentrations. The range of sulphate concentration for a certain wine rating seems to be small.
With R-squared score of 35.3%, the linear predicting model does not help explain the variance in wine quality. Although the model generated shows better correlation to average rating wine, this could be due to high number of average rating wine in data set or missing other key properties that better predict wine quality.
In this project, I was able to examine relationship between physicochemical properties and identify the key variables that determine red wine quality, which are alcohol content and volatile acidity. Some interesting findings of relationship between variables was made sensible using scientific explanation such as relationship between alcohol content, residual sugar and density in wine making process. Data wrangling skill was put into practice in this project for rearranging data into a suitable format. Lack of variable that shows strong correlation to wine quality and high distribution of average rating wine proved to be problematic in performing analysis. It was hard to tell if a true correlation was present. This also shows limitation in generating an accurate predicting model. For future data exploration, it will be interesting to apply different approach in building the algorithm and look into evaluations made by each wine experts as wine tasting is subject to individual preferences.